Authorship Attribution and Verification with Many Authors and Limited Data
نویسندگان
چکیده
Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of the task is as an authorship verification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a new corpus with 145 authors, what the effect is of many authors on feature selection and learning, and show robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning.
منابع مشابه
Text Categorization for Authorship Verification
Abstract. One common version of the authorship attribution problem is that of authorship verification. We need to determine whether a given author, for whom we have a corpus of writing samples, is also the author of a given anonymous text. The set of alternate candidates is not limited to a given finite closed set. In this paper we show how usual text categorization methods can be adapted to so...
متن کاملLarge Scale Authorship Attribution of Online Reviews
Traditional authorship attribution methods focus on the scenario of a limited number of authors writing long pieces of text. These methods are engineered to work on a small number of authors and generally do not scale well to a corpus of online reviews where the candidate set of authors is large. However, attribution of online reviews is important as they are replete with deception and spam. We...
متن کاملMapping co-authorship network of Iranian researchers in the field of knowledge management
Background and aim: So far, many researches have been conducted on the co-authorship study of all authors of universities and organizations as one of the most important topics in the field of scientometrics in various fields and disciplines. The aim of this study was to map the co-authorship network of Iranian researchers in the field of knowledge management in Web of Science (WoS). Material a...
متن کاملComputational methods in authorship attribution
Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following de...
متن کاملAuthorship Attribution Using Text Distortion
Authorship attribution is associated with important applications in forensics and humanities research. A crucial point in this field is to quantify the personal style of writing, ideally in a way that is not affected by changes in topic or genre. In this paper, we present a novel method that enhances authorship attribution effectiveness by introducing a text distortion step before extracting st...
متن کامل